Day23-urllib3-Advanced Usage-1

第 11 屆 iThome 鐵人賽

DAY 23

自我挑戰組

Why it works: python requests and urllib3系列第 23 篇

11th鐵人賽 why it works

j2hongming

2019-10-09 22:24:39

1153 瀏覽

分享至

Customizing pool behavior

針對每個host，PoolManager可以創建並管理ConnectionPool，預設管理數量是10個，若需要訪問的host數量較多，可以透過num_pools調整，要注意的trade-off是Memory和Socket的消耗程度。

http = urllib3.PoolManager(num_pools=50)

ConnectionPool管理一組HTTPConnection，當請求完成時HTTPConnection會回到Pool，預設值是1，若需要同時對同一個host發出多個請求，可以經由max_size調整

http = urllib3.PoolManager(maxsize=10)

maxsize – Number of connections to save that can be reused.
More than 1 is useful in multithreaded situations. If block is set to False, more connections will be created but they will not be saved once they’ve been used.

只設定maxsize代表可重複使用最大數量，若超過這個數字時還需要再發起多的請求，新的連線依然會被建立。
假如需要限制該行為，使用block=True，如此一來，連線數會被侷限在maxsize

import concurrent.futures
import urllib3


URLS = [
    'https://www.youtube.com/results?search_query=taiwan',
    'https://www.youtube.com/results?search_query=news',
    'https://www.youtube.com/results?search_query=weather',
    'https://www.youtube.com/results?search_query=mayday',
    'https://www.youtube.com/results?search_query=serious',
    'https://www.youtube.com/results?search_query=serious+music',
    'https://www.youtube.com/results?search_query=taiwan+weather',
    'https://www.youtube.com/results?search_query=power',
    'https://www.youtube.com/results?search_query=giant+human',
    'https://www.youtube.com/results?search_query=joker'
]


def youtube_it(http, url):
    r = http.request('GET', url)
    return r.data.decode('utf-8')


def query(http):
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        future_to_url = {executor.submit(youtube_it, http, url): url for url in URLS}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                data = future.result()
            except Exception as exc:
                print('%r generated an exception: %s' % (url, exc))
            else:
                print('%r page length is %d' % (url, len(data)))


http = urllib3.PoolManager(maxsize=5, block=True)
query(http)

Streaming and IO

When dealing with large responses it’s often better to stream the response content

使用preload_content=False

import urllib3
http = urllib3.PoolManager()


def stream_download(http, url):
    r = http.request('GET', url, preload_content=False)
    with open('unsplash.jpg', 'wb+') as f:
        print('headers: {}'.format(r.headers))
        for chunk in r.stream(32):
            print('length: {}'.format(len(chunk)))
            print('data: {}'.format(chunk))
            f.write(chunk)
    r.release_conn()


def non_stream_download(http, url):
    r = http.request('GET', url)
    with open('unsplash.jpg', 'wb+') as f:
        print('headers: {}'.format(r.headers))
        print('length: {}'.format(len(r.data)))
        f.write(r.data)


url = 'https://images.unsplash.com/photo-1570475754561-4effe71c5084?ixlib=rb-1.2.1&q=85&fm=jpg&crop=entropy&cs=srgb&dl=pawel-czerwinski-IXgSpDrxsgM-unsplash.jpg'
non_stream_download(http, url)